An blog post illustrating the key techniques of gradient descent
Author
Mia Tarantola
Published
March 9, 2023
Mia Tarantola - Gradient Descent
Introduction
We are looking at the use of gradient descent for optimization and the logistic loss problem. In this assignment I implement gradient descent for logistic regression, implement stochastic gradient descent and perform several experiments that test the limits of the algorithms
The fit method begins with generating a random wight vector with size X_features+1. We then add a column of zeros to X to create our X_ array. We set our initial previous loss to infinity so it is guaranteed to update after the first epoch. Then, while our loss has not converged or we have not reached the maximum number of epochs, do the following:
update w by \(w -= \alpha* \nabla f(X_{})\)
calculate the loss of the current state
update the loss history and score history
set the previous loss = new loss
the loss, predict, score and sigmoid functions were adpated from the lectured notes and perceptron blog post
def draw_line(w, x_min, x_max): x = np.linspace(x_min, x_max, 101) y =-(w[0]*x + w[2])/w[1] plt.plot(x, y, color ="black")fig1 = plt.scatter(X1[:,0], X1[:,1], c = y1)fig1 = draw_line(LR1.w, -2, 2)xlab = plt.xlabel("Feature 1")ylab = plt.ylabel("Feature 2")
The Stochastic fit function beginns similiarly to the original gradient descent function: > - create X_ > - generate random weight vector of size features+1 > - set previous loss = infinity
Then we have to iterate thrugh the following for the number of max epochs or until the loss converges:
shuffle the points randomly
pick the first random k points and update the weights vector using the stochastic gradient
pick the next set of random points and repeat
update the loss and score history
reshuffle the points randomly and iterate again
the gradient function is the same as the original one, we are just inputting a subset of X_ and y_ into the function repeatedly
For these settings, we can see that the stochastic methods are performing much better than the regular gradient descent method. All three methods have a smooth decline in loss, but the stochastic gradient seems to have had the fastest decline. I believe the the gradient methods would need more epochs to find a better solution.
#alpha is too largeLR3 = LogisticRegression()LR3.fit(X3, y3, alpha =200, max_epochs =10000)#alpha is a normal valueLR5 = LogisticRegression()LR5.fit(X3, y3, alpha =.01, max_epochs =10000)
We can see that both alphas result in good separators. But, we have discovered one of the caveats of logistic regression. It can combat alphas that are too large by increasing all of the weights. So, for data that is linearly separable the model performs as good as you want it to. Hence, we will now investigate non-linearly separable data
Here we can see that our model performs fairly well at a smaller alpha (1) as the line mostly separates the data. However, for a larger alpha the algorithm does not work well. It is unable to converge and thus results in a bad separater and the loss bounces around
For this experiment I compared a stochastic batch size of 8 to a batch size of 80. We can see that a smaller batch size allows the algorithm to converge must faster. Increasing the batch size by a factor of 8 almost increased the number of needed epochs by a factor of 10